Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering

نویسندگان

  • Daniel C. Ilut
  • Marie L. Nydam
  • Matthew P. Hare
چکیده

Next generation sequencing holds great promise for applications of phylogeography, landscape genetics, and population genomics in wild populations of nonmodel species, but the robustness of inferences hinges on careful experimental design and effective bioinformatic removal of predictable artifacts. Addressing this issue, we use published genomes from a tunicate, stickleback, and soybean to illustrate the potential for bioinformatic artifacts and introduce a protocol to minimize two sources of error expected from similarity-based de-novo clustering of stacked reads: the splitting of alleles into different clusters, which creates false homozygosity, and the grouping of paralogs into the same cluster, which creates false heterozygosity. We present an empirical application focused on Ciona savignyi, a tunicate with very high SNP heterozygosity (~0.05), because high diversity challenges the computational efficiency of most existing nonmodel pipelines while also potentially exacerbating paralog artifacts. The simulated and empirical data illustrate the advantages of using higher sequence difference clustering thresholds than is typical and demonstrate the utility of our protocol for efficiently identifying an optimum threshold from data without prior knowledge of heterozygosity. The empirical Ciona savignyi data also highlight null alleles as a potentially large source of false homozygosity in restriction-based reduced representation genomic data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genetic Polymorphism at MTNR1A, CAST and CAPN Loci in Iranian Karakul Sheep

Genotypes for melatonin receptor type 1A (MTNR1A) and Calpastatin (CAST) were determined by enzymatic digestion of PCR products and Calpain(CAPN) genotype detected by PCR-SSCP method in Iranian Karakul sheep. Blood samples were collected from 100 purebred Karakul sheep. The extraction of genomic DNA was based on guanidinium thiocyanate- silica gel method. PCR amplicons were digested with restri...

متن کامل

Predicting RAD-seq Marker Numbers across the Eukaryotic Tree of Life

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes--generically known as restriction site associated DNA sequencing (RAD-seq)--is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design element of any RAD-seq study is knowledge of the approximate number of genetic markers that can be...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout.

The increased numbers of genetic markers produced by genomic techniques have the potential to both identify hybrid individuals and localize chromosomal regions responding to selection and contributing to introgression. We used restriction-site-associated DNA sequencing to identify a dense set of candidate SNP loci with fixed allelic differences between introduced rainbow trout (Oncorhynchus myk...

متن کامل

Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA.

Sufficient quantity of genomic DNA can be a bottleneck in genome-wide analysis of clinical tissue samples. DNA polymerase Phi29 can be used for the random-primed amplification of whole genomes, although the amplification may introduce bias in gene dosage. We have performed a detailed investigation of this technique in archival fresh-frozen and formalin-fixed/paraffin-embedded tumor DNA by using...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2014  شماره 

صفحات  -

تاریخ انتشار 2014